skip to main content


Search for: All records

Creators/Authors contains: "Chatterjee, Abhijit"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Online reinforcement learning (RL) based systems are being increasingly deployed in a variety of safety-critical applications ranging from drone control to medical robotics. These systems typically use RL onboard rather than relying on remote operation from high-performance datacenters. Due to the dynamic nature of the environments they work in, onboard RL hardware is vulnerable to soft errors from radiation, thermal effects and electrical noise that corrupt the results of computations. Existing approaches to on-line error resilience in machine learning systems have relied on availability of the large training datasets to configure resilience parameters, which is not necessarily feasible for online RL systems. Similarly, other approaches involving specialized hardware or modifications to training algorithms are difficult to implement for onboard RL applications. In contrast, we present a novel error resilience approach for online RL that makes use of running statistics collected across the (real-time) RL training process to configure error detection thresholds without the need to access a reference training dataset. In this methodology, statistical concentration bounds leveraging running statistics are used to diagnose neuron outputs as erroneous. These erroneous neurons are then set to zero (suppressed). Our approach is compared against the state of the art and validated on several RL algorithms involving the use of multiple concentration bounds on CPU as well as GPU hardware. 
    more » « less
    Free, publicly-accessible full text available July 3, 2024
  2. Spiking Neural Networks (SNNs) can be implemented with power-efficient digital as well as analog circuitry. However, in Resistive RAM (RRAM) based SNN accelerators, synapse weights programmed into the crossbar can differ from their ideal values due to defects and programming errors, degrading inference accuracy. In addition, circuit nonidealities within analog spiking neurons that alter the neuron spiking rate (modeled by variations in neuron firing threshold) can degrade SNN inference accuracy when the value of inference time steps (ITSteps) of SNN is set to a critical minimum that maximizes network throughput. We first develop a recursive linearized check to detect synapse weight errors with high sensitivity. This triggers a correction methodology which sets out-of-range synapse values to zero. For correcting the effects of firing threshold variations, we develop a test methodology that calibrates the extent of such variations. This is then used to proportionally increase inference time steps during inference for chips with higher variation. Experiments on a variety of SNNs prove the viability of the proposed resilience methods. 
    more » « less
    Free, publicly-accessible full text available May 29, 2024
  3. Transformer networks have achieved remarkable success in Natural Language Processing (NLP) and Computer Vision applications. However, the underlying large volumes of Transformer computations demand high reliability and resilience to soft errors in processor hardware. The objective of this research is to develop efficient techniques for design of error resilient Transformer architectures. To enable this, we first perform a soft error vulnerability analysis of every fully connected layers in Transformer computations. Based on this study, error detection and suppression modules are selectively introduced into datapaths to restore Transformer performance under anticipated error rate conditions. Memory access errors and neuron output errors are detected using checksums of linear Transformer computations. Correction consists of determining output neurons with out-of-range values and suppressing the same to zero. For a Transformer with nominal BLEU score of 52.7, such vulnerability guided selective error suppression can recover language translation performance from a BLEU score of 0 to 50.774 with as much as 0.001 probability of activation error, incurring negligible memory and computation overheads. 
    more » « less
    Free, publicly-accessible full text available May 22, 2024
  4. The reliability of emerging neuromorphic compute fabrics is of great concern due to their widespread use in critical data-intensive applications. Ensuring such reliability is difficult due to the intensity of underlying computations (billions of parameters), errors induced by low power operation and the complex relationship between errors in computations and their effect on network performance accuracy. We study the problem of designing error-resilient neuromorphic systems where errors can stem from: (a) soft errors in computation of matrix-vector multiplications and neuron activations, (b) malicious trojan and adversarial security attacks and (c) effects of manufacturing process variations on analog crossbar arrays that can affect DNN accuracy. The core principle of error detection relies on embedded predictive neuron checks using invariants derived from the statistics of nominal neuron activation patterns of hidden layers of a neural network. Algorithmic encodings of hidden neuron function are also used to derive invariants for checking. A key contribution is designing checks that are robust to the inherent nonlinearity of neuron computations with minimal impact on error detection coverage. Once errors are detected, they are corrected using probabilistic methods due to the difficulties involved in exact error diagnosis in such complex systems. The technique is scalable across soft errors as well as a range of security attacks. The effects of manufacturing process variations are handled through the use of compact tests from which DNN performance can be assessed using learning techniques. Experimental results on a variety of neuromorphic test systems: DNNs, spiking networks and hyperdimensional computing are presented. 
    more » « less
  5. Deep learning techniques have been widely adopted in daily life with applications ranging from face recognition to recommender systems. The substantial overhead of conventional error tolerance techniques precludes their widespread use, while approaches involving median filtering and invariant generation rely on alterations to DNN training that may be difficult to achieve for larger networks on larger datasets. To address this issue, this paper presents a novel approach taking advantage of the statistics of neuron output gradients to identify and suppress erroneous neuron values. By using the statistics of neurons’ gradients with respect to their neighbors, tighter statistical thresholds are obtained compared to the use of neuron output values alone. This approach is modular and is combined with accurate, low-overhead error detection methods to ensure it is used only when needed, further reducing its cost. Deep learning models can be trained using standard methods and our error correction module is fit to a trained DNN, achieving comparable or superior performance compared to baseline error correction methods while incurring comparable hardware overhead without needing to modify DNN training or utilize specialized hardware architectures. 
    more » « less
  6. null (Ed.)
  7. null (Ed.)
    Safety is a critical component in today's autonomous and robotic systems. Many modern controllers endowed with notions of guaranteed safety properties rely on accurate mathematical models of these nonlinear dynamical systems. However, model uncertainty is always a persistent challenge weakening theoretical guarantees and compromising safety. For safety-critical systems, this is an even bigger challenge. Typically, safety is ensured by constraining the system states within a safe constraint set defined a priori by relying on the model of the system. A popular approach is to use Control Barrier Functions (CBFs) that encode safety using a smooth function. However, CBFs fail in the presence of model uncertainties. Moreover, an inaccurate model can either lead to incorrect notions of safety or worse, incur system critical failures. Addressing these drawbacks, we present a novel safety formulation that leverages properties of CBFs and positive definite kernels to design Gaussian CBFs. The underlying kernels are updated online by learning the unmodeled dynamics using Gaussian Processes (GPs). While CBFs guarantee forward invariance, the hyperparameters estimated using GPs update the kernel online and thereby adjust the relative notion of safety. We demonstrate our proposed technique on a safety-critical quadrotor on SO(3) in the presence of model uncertainty in simulation. With the kernel update performed online, safety is preserved for the system. 
    more » « less
  8. null (Ed.)
    Modern 5G and projected 6G wireless systems deploy massive MIMO systems with antenna arrays and novel RF transceiver architectures that admit RF beamforming. Testing and tuning of the underlying transceiver arrays on a per-transceiver basis is expensive and can be expedited through the use of parallel testing and tuning techniques that stimulate the entire array transceiver system concurrently. State of the art parallel testing techniques require frequency separation between the tones applied to individual RF chains due to combining of RF signals before down-conversion in analog beamforming MIMO systems. Test schemes that allow some frequency overlap are limited to testing only third order distortion. In this paper, we first present a parallel testing scheme for testing large MIMO transceiver arrays that is amenable to higher order distortion (upto fifth order) in the RF chains considered. Second, we propose a tuning scheme for the entire MIMO array which implicitly tunes for EVM system specifications without explicit knowledge of the relationship between the system test response, the system tuning knobs and the corresponding EVM and SINR specification values. A cost metric is formulated that allows such a solution using reinforcement (multi-arm bandit) learning driven system tuning. Significant yield improvement using this approach is demonstrated by simulation experiments. 
    more » « less